-
-
Notifications
You must be signed in to change notification settings - Fork 54
[Store] Add a way to store document content when using Chroma DB #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Store] Add a way to store document content when using Chroma DB #288
Conversation
IIRC we had something like this in the past, did we @chr-hertel ? Other than that, it could be a good idea I guess. |
Other than that, with #57 I implemented a So, my theory here: The Idea
(instead a simple ID we could also think of a more sophisticated refernece object or sth, but i'd be lazy on that) Now |
Its the original content, the string from which the embedding vectors are generated.
Looking at #57, I can see that the The I think that what concerns me a bit is that there are now kind-of standardized metadata fields getting introduced, on which the individual store implementations need to start to rely on if they want to implement certain feature (for example, take the VectorDocument->metadata->text and use it to generate documents when batch storing data into Chroma DB). The
Then those should probably be enumerated in some Alternative would be to have these (or some of these) as actual fields on the
And have them more strictly typed in this way, keeping the I am not sure which approach is better overall, but |
@chr-hertel Should I try to go for adding of |
So there's actually a third alternative, that might combine both worlds: I think that's currently my favorite after looking at this - we need to think about other stores as well - what do you think about that? |
Sounds good! 👍 So |
Yes, let's go with text - it relates more to |
ebee1c4
to
563efa2
Compare
563efa2
to
93ed5c6
Compare
@chr-hertel @OskarStark its ready for review, please let me know if I need to change anything.. 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extends the Store component to support storing document content alongside vectors in Chroma DB by introducing standardized metadata keys and updating the ChromaDB store implementation.
- Adds structured metadata fields with constants for parent ID, text content, and source
- Updates ChromaDB store to extract text content from metadata and pass it as original documents
- Standardizes metadata key usage across loaders and transformers with underscore-prefixed constants
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
src/store/src/Document/Metadata.php |
Adds constants and methods for standardized metadata keys (_parent_id, _text, _source) |
src/store/src/Bridge/ChromaDb/Store.php |
Modifies add() method to extract text content and pass as original documents to ChromaDB |
src/store/src/Document/Loader/TextFileLoader.php |
Updates to use Metadata::KEY_SOURCE constant instead of hardcoded 'source' |
src/store/src/Document/Transformer/TextSplitTransformer.php |
Updates to use Metadata constants for parent_id and text keys |
src/store/tests/Document/MetadataTest.php |
Comprehensive test coverage for new Metadata functionality |
src/store/tests/Bridge/ChromaDb/StoreTest.php |
Updates tests to verify text extraction and original document handling |
src/store/tests/Document/Loader/TextFileLoaderTest.php |
Updates test assertions to check both key formats |
src/store/tests/Document/Transformer/TextSplitTransformerTest.php |
Updates test assertions for new metadata key format |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Not really impressed by copilots review I must say ... 😞 |
return $this->offsetExists(self::KEY_TEXT); | ||
} | ||
|
||
public function setText(?string $text): void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to be nullable?
public function setText(?string $text): void | |
public function setText(string $text): void |
: null; | ||
} | ||
|
||
public function setSource(?string $source): void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to be nullable?
public function setSource(?string $source): void | |
public function setSource(string $source): void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, but looks good to me already 👍
This is just a draft for now to explain what I am trying to achieve - I would like to be able to store the original document content in the Chroma DB records, but the
VectorDocument
currently has no way to pass that in, other than the metadata. I feel like this requirement will not be specific just to Chroma DB, but various databases used for embeddings will support storing of the original document content along with the vectors.Any suggestions on how to approach this holistically? Should the
VectorDocument
be expanded with optional field like?string $content = null
?Additionally (see https://symfony.com/releases):
-->